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Validation 


Roadmap 


@ When Can Machines Learn? 

© Why Can Machines Learn? 

© How Can Machines Learn? 

© How Can Machines Learn Better? 


Lecture 14: Regularization 


minimizes augmented error, where the added 
regularizer effectively limits model complexity 












Lecture 15: Validation 
Model Selection Problem 
Validation 

Leave-One-Out Cross Validation 
V-Fold Cross Validation 
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Validation Model Selection Problem 


So Many Models Learned 


Even Just for Binary Classification ... 
A € { PLA, pocket, linear regression, logistic regression} 


x 
T € { 100, 1000, 10000} 
x 
n € { 1,0.01, 0.0001 } 
x 
® € { linear, quadratic, poly-10, Legendre-poly-10} 
x 
Q(w) € { L2 regularizer, L1 regularizer, symmetry regularizer} 


x 
A € {0,0.01, 1} 


in addition to your favorite combination, may 
need to try other combinations to get a good g | 
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Validation Model Selection Problem 


Model Selection Problem 


which one do you prefer? :-) | 





e given: M models H1, H2,..., Hm, each with corresponding 
algorithm A4, A2,..., Am 


e goal: select Hm» such that gm = Am (D) is of low Eout(Qm«) 
e unknown Eout due to unknown P(x) & P(y|x), as always :-) 
arguably the most important practical problem of ML 





how to select? visually? 
—no, remember Lecture 12? :-) | 
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Validation Model Selection Problem 


Model Selection by Best Ein 


select by best Ein? 






m = argmin(Em = En(Am(D))) 
1<m<M 





e @®;126 always more preferred over &;; 
A = 0 always more preferred over \ = 0.1—overfitting? 
e if Ay minimizes Ej, over H4 and As minimizes Ein over Ho, 
=> gm achieves minimal Ein over Hı U He 
= ‘model selection + learning’ pays dye (H1 U Ho) 
—bad generalization? 





selecting by Ein is dangerous | 
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Validation Model Selection Problem 


Model Selection by Best Etest 


select by best Fiest, which is 
evaluated on a fresh Diest? 






m* = argmin(Em = Etest(Am(P)) 
1<m<M 





» generalization guarantee (finite-bin Hoeffding): 


Eout(9m) < Etest(9m) + O ( a) 
— yes! strong guarantee :-) 
e but where is Diest?—your boss’s safe, maybe? :-( 








selecting by Etest is infeasible and cheating | 
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Validation Model Selection Problem 


Comparison between Ej, and Etest 


in-sample error 
e calculated from D 
e feasible on hand 


e ‘contaminated’ as D also 
used by Am to ‘select’ gm 



















+ calculated from Diest 
e infeasible in boss’s safe 


e ‘clean’ as Diest Never used 
for selection before 


something in between: Eva 
e calculated from Dya, C D 
e feasible on hand 


e ‘clean’ if Dya never used 
by Am before 


selecting by Eya: legal cheating :-) | 
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Validation Model Selection Problem 


Fun Time 
For X = RI, consider two hypothesis sets, H+ and H_. The first 
hypothesis set contains all perceptrons with w; > 0, and the second 
hypothesis set contains all perceptrons with w; < 0. Denote g and g_ 
as the minimum-E;n hypothesis in each hypothesis set, respectively. 
Which statement below is true? 
© If En(9+) < Ein(g_), then g+ is the minimum-Ej, hypothesis of all 
perceptrons in R”. 
© If Etest(9+) < Etest(g_), then g+ is the minimum- Etest hypothesis of 
all perceptrons in R. 
© The two hypothesis sets are disjoint. 
© None of the above 
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Validation Model Selection Problem 


Fun Time 


For X = RI, consider two hypothesis sets, H+ and H_. The first 
hypothesis set contains all perceptrons with w; > 0, and the second 
hypothesis set contains all perceptrons with w; < 0. Denote g+ and g_ 
as the minimum-E;n hypothesis in each hypothesis set, respectively. 
Which statement below is true? 
© If En(9+) < Ein(g_), then g+ is the minimum-Ej, hypothesis of all 
perceptrons in R”. 
@ If Erest(9+) < Ftest(g_), then g4 is the minimum-Ejest hypothesis of 
all perceptrons in R. 
© The two hypothesis sets are disjoint. 


© None of the above 


Reference Answer: (1) 


Note that the two hypothesis sets are not 
disjoint (sharing ‘w = 0’ perceptrons) but their 


union is all perceptrons. | 
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Validation Validation 


Validation Set Dyal 


En(h) Evai(h) 
t i 
= => Dirain U Dyal 
size N size N-K size K 


J 1 
Im = Am(D) Im = Am(Pirain) 





e Da C D: called validation set—‘on-hand’ simulation of test set 


e to connect Eyaı with Eout: 
DA > P(x,y) = select K examples from D at random 


e to make sure Dya ‘clean’: 
feed only Dirain tO Am for model selection 





Eout(9m) < Evaılgm) + O (v) 
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Validation Validation 


Model Selection by Best Eval 





m* = argmin(Em = Evai(Am(Drrain))) 











1<m<M Hı Hə -Hm 
e generalization guarantee for all m: Drain i f i 
Eout(Qim) < Evai(Qm) + o( 3%) mi PO 
val i f i 
e heuristic gain from N — K to N: 


pick the best 








Eout Im* < Sa Or Em*) 
x = D 
Am(D) Am* (Dirain) 








—learning curve, remember? :-) Im* 


Eout(9m*) < Eout(9m) < Evai(Qm«) + O ( Eu) 
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Validation Validation 


Validation in Practice 


use validation to select between Ho, and Ho,, | 





Brag validation: g» e in-sample: selection 
in-sample: gm with Ein 

: e optimal: cheating-selection 
ica) f 
3 0.52 validation: gm» with Etest 
8 e sub-g: selection with Eyal 
É and report gr 

0.48 e full-g: selection with Eyal 

optimal 
es eee and report gm 
5 15 25 u = 
Validation Set Size, K Eout(Im* ) Š Eout(Im+ ) 


indeed 





why is sub-g worse than in-sample some time? | 
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Validation Validation 


The Dilemma about K 
reasoning of validation: 






Eout(9) x Eou(g ) x Evai(g” ) 
(small K) 


validation: gn» 


e large K: every Eya ~ Eout, 

but all g, much worse than gm 
e small K: every 9m ~ Jm, 
but Eya; far from Eout 


in-sample: gm 


Ss 
un 
SD 


validation: gm* 





Expected Bout 


S 
> 
oo 





5, 15 
Validation Set Size, K 


practical rule of thumb: K = / | 


Validation Validation 


Fun Time 


For a learning model that takes N? seconds of training when using N 
examples, what is the total amount of seconds needed when running 
the whole validation procedure with K = x on 25 such models with 
different parameters to get the final gm? 


© 6N? 

© 17N? 
© 25N? 
© 26N? 
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Validation Validation 


Fun Time 


For a learning model that takes N? seconds of training when using N 
examples, what is the total amount of seconds needed when running 
the whole validation procedure with K = x on 25 such models with 
different parameters to get the final gm? 

© 6N? 

© 17N? 

© 25N? 

© 26N? 


Reference Answer: © 


To get all the gm, we need 18 p2 . 25 seconds. 
Then to get gm, we need another N? seconds. 
So in total we need 17N? seconds. 
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Validation Leave-One-Out Cross Validation 


Extreme Case: K = 1 
reasoning of validation: 






Eout(9) x Eout(9°) I Eya(g ) 
(small K) (large K) 





e take K = 1? D!? = {(Xn, yn)} and E} (gz) = err(g7 (Xn), Yn) = en 


val 






e make e, closer to Eout(g)?—average over possible EC) 
e leave-one-out cross validation estimate: 


N 
Eiocv(H,A) = N vy — N ye err(9n (Xn); Yn) 


hope: Eioocv(H,.A) x Eout(9) 


Validation Leave-One-Out Cross Validation 


Illustration of Leave-One-Out 




















Eioocv(linear) = s(ei + e2 + @3) 


o 















































Eioocv(constant) = 4(e1 + e2 + e3) 


which one would you choose? 
m* = argmin(Em = Eioocv(Hm, Am)) 
1<m<M 






Validation Leave-One-Out Cross Validation 


Theoretical Guarantee of Leave-One-Out Estimate 
does Eioocv(H,.A) say something about Eour(g)? | 
yes, for average Eou on size-(N — 1) data 
1A 1A 
E Eioocv(H, A) = Eden = TAA 


1 N 
= HŽ E E emlgn (Xn), yn) 


Dn(Xn,Yn) 


1 N 
= oe Eout(9n ) 


Dn 


N 
1 — — 


expected Eicocv(H, A) says something about expected Eout(9” ) 
—often called ‘almost unbiased estimate of Eout(g)’ 


Validation Leave-One-Out Cross Validation 


Leave-One-Out in Practice 





Symmetry 


x Notl 





Average Intensity 


0.03 


0.02 


Tror 


Err 


0.01 





+ 
5 


— io 15 
# Features Used 
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Symmetry 











select by Ein 


Symmetry 














Average Intensity 


select by Eioocv 


Eioocv Much better than Ein 
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Validation Leave-One-Out Cross Validation 


Fun Time 


Consider three examples (X4, y1), (X2, Y2), (X3, ¥3) with y4 = 1, y2 = 5, 
Ya = 7. If we use Ejoocv to estimate the performance of a learning 
algorithm that predicts with the average y value of the data set—the 
optimal constant prediction with respect to the squared error. What is 
E\oocv (Squared error) of the algorithm? 
@ 0 
56 
> Br 
60 
ee; 
© 14 
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Validation Leave-One-Out Cross Validation 


Fun Time 


Consider three examples (X4, y1), (X2, Y2), (X3, ¥3) with y4 = 1, y2 = 5, 
Y3 = 7. If we use Ejoocv to estimate the performance of a learning 
algorithm that predicts with the average y value of the data set—the 
optimal constant prediction with respect to the squared error. What is 
Eioocv (Squared error) of the algorithm? 
00 
56 
ee; 
60 
ee; 
© 14 








Reference Answer: © 


This is based on a simple calculation of 
& = (1-6), e2 = (5 — 4)’, eg = (7 - 3}. 








Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 17/22 


Validation V-Fold Cross Validation 


Disadvantages of Leave-One-Out Estimate 
Computation 








Eiooev(H,A) = N yer aN em In (Xn), Yn) 


e N ‘additional’ training per Medal not are feasible in practice 
e except ‘special case’ like analytic solution for linear regression 


Stability—due to variance of single-point estimates 











5 10 15 
# Features Used 


Eioocv: not often used practically j 
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Validation V-Fold Cross Validation 


V-fold Cross Validation 
how to decrease computation need for cross validation? | 


e essence of leave-one-out cross validation: partition D to N parts, 
taking N — 1 for training and 1 for validation orderly 


e V-fold cross-validation: random-partition of D to V equal parts, 
D 


p 
Dı Də D3 Dı Ds De Dr Ds Do Dio 
train validate train 
take V - 1 for training and 1 for validation orderly 


Ew(H, A) = 7 eer 


e selection by Ew: m* = argmin(Em = Ew(Hm, Am)) 
1<m<M 





practical rule of thumb: V = 10 | 
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Validation V-Fold Cross Validation 


Final Words on Validation 





‘Selecting’ Validation Tool 


e V-Fold generally preferred over single validation if computation 
allows 


e 5-Fold or 10-Fold generally works well: 
not necessary to trade V-Fold with Leave-One-Out 

















Nature of Validation 
e all training models: select among hypotheses 
e all validation schemes: select among finalists 





e all testing methods: just evaluate 


validation still more optimistic than testing 





do not fool yourself and others :-), 
report test result, not best validation result | 
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Validation V-Fold Cross Validation 


Fun Time 


For a learning model that takes N? seconds of training when using N 
examples, what is the total amount of seconds needed when running 
10-fold cross validation on 25 such models with different parameters to 
get the final gm? 


47 
@ ZN 
© 47N2 
407 
© AEN? 
© 407N2 
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Validation V-Fold Cross Validation 


Fun Time 


For a learning model that takes N? seconds of training when using N 
examples, what is the total amount of seconds needed when running 
10-fold cross validation on 25 such models with different parameters to 


get the final gm? 
47 
O FN 
© 47N? 
407 
© BN? 
© 407N? 


Reference Answer: © 


To get all the Ecv, we need 8 N? - 10-25 
seconds. Then to get gm, we need another N? 
seconds. So in total we need 48/N? seconds. 
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Validation V-Fold Cross Validation 
Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
© How Can Machines Learn? 
© How Can Machines Learn Better? 


Lecture 14: Regularization 






Lecture 15: Validation 


ə Model Selection Problem 
dangerous by En and dishonest by Eves; 
ə Validation 
select with Eyai(Am(DPtrain)) while returning Am (D) 
ə Leave-One-Out Cross Validation 
huge computation for almost unbiased estimate 
e V-Fold Cross Validation 
reasonable computation and performance 









e next: something ‘up my sleeve’ 
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